Efficient complex multiplication and fast fourier transform (fft) implementation on the manarray architecture

ABSTRACT

Efficient computation of complex multiplication results and very efficient fast Fourier transforms (FFTs) are provided. A parallel array VLIW digital signal processor is employed along with specialized complex multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs are used allowing the complex multiplication pipeline hardware to be efficiently used. In addition, efficient techniques for supporting combined multiply accumulate operations are described.

[0001] This application claims the benefit of U.S. ProvisionalApplication Serial No. 60/103,712 filed Oct. 9, 1998 which isincorporated by reference in its entirety herein.

FIELD OF THE INVENTION

[0002] The present invention relates generally to improvements toparallel processing, and more particularly to methods and apparatus forefficiently calculating the result of a complex multiplication. Further,the present invention relates to the use of this approach in a veryefficient FFT implementation on the manifold array (“ManArray”)processing architecture.

BACKGROUND OF THE INVENTION

[0003] The product of two complex numbers x and y is defined to bez=x_(R)y_(R)−x_(I)y_(I)+i(x_(R)y_(I)+x_(I)y_(R)), where x=x_(R)+ix_(I),y=y_(R)+iy_(I) and i is an imaginary number, or the square root ofnegative one, with i²=−1. This complex multiplication of x and y iscalculated in a variety of contexts, and it has been recognized that itwill be highly advantageous to perform this calculation faster and moreefficiently.

SUMMARY OF THE INVENTION

[0004] The present invention defines hardware instructions to calculatethe product of two complex numbers encoded as a pair of two fixed-pointnumbers of 16 bits each in two cycles with single cycle pipelinethroughput efficiency. The present invention also defines extending aseries of multiply complex instructions with an accumulate operation.These special instructions are then used to calculate the FFT of avector of numbers efficiently.

[0005] A more complete understanding of the present invention, as wellas other features and advantages of the invention will be apparent fromthe following Detailed Description and the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006]FIG. 1 illustrates an exemplary 2×2 ManArray iVLIW processor;

[0007]FIG. 2A illustrates a presently preferred multiply complexinstruction, MPYCX;

[0008]FIG. 2B illustrates the syntax and operation of the MPYCXinstruction of FIG. 2A;

[0009]FIG. 3A illustrates a presently preferred multiply complex divideby 2 instruction, MPYCXD2;

[0010]FIG. 3B illustrates the syntax and operation of the MPYCXD2instruction of FIG. 3A;

[0011]FIG. 4A illustrates a presently preferred multiply complexconjugate instruction, MPYCXJ;

[0012]FIG. 4B illustrates the syntax and operation of the MPYCXJinstruction of FIG. 4A;

[0013]FIG. 5A illustrates a presently preferred multiply complexconjugate divide by two instruction, MPYCXJD2;

[0014]FIG. 5B illustrates the syntax and operation of the MPYCXJD2instruction of FIG. 5A;

[0015]FIG. 6 illustrates hardware aspects of a pipelined multiplycomplex and its divide by two instruction variant;

[0016]FIG. 7 illustrates hardware aspects of a pipelined multiplycomplex conjugate, and its divide by two instruction variant;

[0017]FIG. 8 shows an FFT signal flow graph;

[0018] FIGS. 9A-9H illustrate aspects of the implementation of adistributed FFT algorithm on a 2×2 ManArray processor using a VLIWalgorithm with MPYCX instructions in a cycle-by-cycle sequence with eachstep corresponding to operations in the FFT signal flow graph;

[0019]FIG. 9I illustrates how multiple iterations may be tightly packedin accordance with the present invention for a distributed FFT of lengthfour;

[0020]FIG. 9J illustrates how multiple iterations may be tightly packedin accordance with the present invention for a distributed FFT of lengthtwo;

[0021]FIGS. 10A and 10B illustrate Kronecker Product examples for use inreference to the mathematical presentation of the presently preferreddistributed FFT algorithm;

[0022]FIG. 11A illustrates a presently preferred multiply accumulateinstruction, MPYA;

[0023]FIG. 11B illustrates the syntax and operation of the MPYAinstruction of FIG. 11A;

[0024]FIG. 12A illustrates a presently preferred sum of 2 productsaccumulate instruction, SUM2PA;

[0025]FIG. 12B illustrates the syntax and operation of the SUM2PAinstruction of FIG. 12A;

[0026]FIG. 13A illustrates a presently preferred multiply complexaccumulate instruction, MPYCXA;

[0027]FIG. 13B illustrates the syntax and operation of the MPYCXAinstruction of FIG. 13A;

[0028]FIG. 14A illustrates a presently preferred multiply complexaccumulate divide by two instruction, MPYCXAD2;

[0029]FIG. 14B illustrates the syntax and operation of the MPYCXAD2instruction of FIG. 14A;

[0030]FIG. 15A illustrates a presently preferred multiply complexconjugate accumulate instruction, MPYCXJA;

[0031]FIG. 15B illustrates the syntax and operation of the MPYCXJAinstruction of FIG. 15A;

[0032]FIG. 16A illustrates a presently preferred multiply complexconjugate accumulate divide by two instruction, MPYCXJAD2;

[0033]FIG. 16B illustrates the syntax and operation of the MPYCXJAD2instruction of FIG. 16A;

[0034]FIG. 17 illustrates hardware aspects of a pipelined multiplycomplex accumulate and its divide by two variant; and

[0035]FIG. 18 illustrates hardware aspects of a pipelined multiplycomplex conjugate accumulate and its divide by two variant.

DETAILED DESCRIPTION

[0036] Further details of a presently preferred ManArray architecturefor use in conjunction with the present invention are found in U.S.patent application Ser. No. 08/885,310 filed Jun. 30, 1997, U.S. patentapplication Ser. No. 08/949,122 filed Oct. 10, 1997, U.S. patentapplication Ser. No. 09/169,255 filed Oct. 9, 1998, U.S. patentapplication Ser. No. 09/169,256 filed Oct. 9, 1998, U.S. patentapplication Ser. No. 09/169,072 filed Oct. 9, 1998, U.S. patentapplication Ser. No. 09/187,539 filed Nov. 6, 1998, U.S. patentapplication Ser. No. 09/205,558 filed Dec. 4, 1998, U.S. patentapplication Ser. No. 09/215,081 filed Dec. 18, 1998, U.S. patentapplication Ser. No. 09/228,374 filed Jan. 12, 1999, U.S. patentapplication Ser. No. 09/238,446 filed Jan. 28, 1999, U.S. patentapplication Ser. No. 09/267,570 filed Mar. 12, 1999, as well as,Provisional Application Serial No. 60/092,130 entitled “Methods andApparatus for Instruction Addressing in Indirect VLIW Processors” filedJul. 9, 1998, Provisional Application Serial No. 60/103,712 entitled“Efficient Complex Multiplication and Fast Fourier Transform (FFT)Implementation on the ManArray” filed Oct. 9, 1998, ProvisionalApplication Serial No. 60/106,867 entitled “Methods and Apparatus forImproved Motion Estimation for Video Encoding” filed Nov. 3, 1998,Provisional Application Serial No. 60/113,637 entitled “Methods andApparatus for Providing Direct Memory Access (DMA) Engine” filed Dec.23, 1998 and Provisional Application Serial No. 60/113,555 entitled“Methods and Apparatus Providing Transfer Control” filed Dec. 23, 1998,respectively, and incorporated by reference herein in their entirety.

[0037] In a presently preferred embodiment of the present invention, aManArray 2×2 iVLIW single instruction multiple data stream (SIMD)processor 100 shown in FIG. 1 contains a controller sequence processor(SP) combined with processing element-0 (PE0) SP/PE0 101, as describedin further detail in U.S. application Ser. No. 09/169,072 entitled“Methods and Apparatus for Dynamically Merging an Array Controller withan Array Processing Element”. Three additional PEs 151, 153, and 155 arealso utilized to demonstrate the implementation of efficient complexmultiplication and fast fourier transform (FFT) computations on theManArray architecture in accordance with the present invention. It isnoted that the PEs can be also labeled with their matrix positions asshown in parentheses for PE0 (PE00) 101, PE1 (PE01)151, PE2 (PE10) 153,and PE3 (PE11) 155.

[0038] The SP/PE0 101 contains a fetch controller 103 to allow thefetching of short instruction words (SIWs) from a 32-bit instructionmemory 105. The fetch controller 103 provides the typical functionsneeded in a programmable processor such as a program counter (PC),branch capability, digital signal processing, EP loop operations,support for interrupts, and also provides the instruction memorymanagement control which could include an instruction cache if needed byan application. In addition, the SIW I-Fetch controller 103 dispatches32-bit SIWs to the other PEs in the system by means of a 32-bitinstruction bus 102.

[0039] In this exemplary system, common elements are used throughout tosimplify the explanation, though actual implementations are not solimited. For example, the execution units 131 in the combined SP/PE0 101can be separated into a set of execution units optimized for the controlfunction, e.g. fixed point execution units, and the PE0 as well as theother PEs 151, 153 and 155 can be optimized for a floating pointapplication. For the purposes of this description, it is assumed thatthe execution units 131 are of the same type in the SP/PE0 and the otherPEs. In a similar manner, SP/PE0 and the other PEs use a fiveinstruction slot iVLIW architecture which contains a very longinstruction word memory (VIM) memory 109 and an instruction decode andVIM controller function unit 107 which receives instructions asdispatched from the SP/PE0's I-Fetch unit 103 and generates the VIMaddress-and-control signals 108 required to access the iVLIWs stored inthe VIM. These iVLIWs are identified by the letters SLAMD in VIM 109.The loading of the iVLIWs is described in further detail in U.S. patentapplication Ser. No. 09/187,539 entitled “Methods and Apparatus forEfficient Synchronous MIMD Operations with iVLIW PE-to-PECommunication”. Also contained in the SP/PE0 and the other PEs is acommon PE configurable register file 127 which is described in furtherdetail in U.S. patent application Ser. No. 09/169,255 entitled “Methodsand Apparatus for Dynamic Instruction Controlled ReconfigurationRegister File with Extended Precision”.

[0040] Due to the combined nature of the SP/PE0, the data memoryinterface controller 125 must handle the data processing needs of boththe SP controller, with SP data in memory 121, and PE0, with PE0 data inmemory 123. The SP/PE0 controller 125 also is the source of the datathat is sent over the 32-bit broadcast data bus 126. The other PEs 151,153, and 155 contain common physical data memory units 123′, 123″, and123′″ though the data stored in them is generally different as requiredby the local processing done on each PE. The interface to these PE datamemories is also a common design in PEs 1, 2, and 3 and indicated by PElocal memory and data bus interface logic 157, 157′ and 157″.Interconnecting the PEs for data transfer communications is the clusterswitch 171 more completely described in U.S. patent application Ser. No.08/885,310 entitled “Manifold Array Processor”, U.S. application Ser.No. 09/949,122 entitled “Methods and Apparatus for Manifold ArrayProcessing”, and U.S. application Ser. No. 09/169,256 entitled “Methodsand Apparatus for ManArray PE-to-PE Switch Control”. The interface to ahost processor, other peripheral devices, and/or external memory can bedone in many ways. The primary mechanism shown for completeness iscontained in a direct memory access (DMA) control unit 181 that providesa scalable ManArray data bus 183 that connects to devices and interfaceunits external to the ManArray core. The DMA control unit 181 providesthe data flow and bus arbitration mechanisms needed for these externaldevices to interface to the ManArray core memories via the multiplexedbus interface represented by line 185. A high level view of a ManArrayControl Bus (MCB) 191 is also shown.

[0041] All of the above noted patents are assigned to the assignee ofthe present invention and incorporated herein by reference in theirentirety.

[0042] Special Instructions for Complex Multiply

[0043] Turning now to specific details of the ManArray processor asadapted by the present invention, the present invention defines thefollowing special hardware instructions that execute in each multiplyaccumulate unit (MAU), one of the execution units 131 of FIG. 1 and ineach PE, to handle the multiplication of complex numbers:

[0044] MPYCX instruction 200 (FIG. 2A), for multiplication of complexnumbers, where the complex product of two source operands is roundedaccording to the rounding mode specified in the instruction and loadedinto the target register. The complex numbers are organized in thesource register such that halfword H1 contains the real component andhalfword H0 contains the imaginary component. The MPYCX instructionformat is shown in FIG. 2A. The syntax and operation description 210 isshown in FIG. 2B.

[0045] MPYCXD2 instruction 300 (FIG. 3A), for multiplication of complexnumbers, with the results divided by 2, FIG. 3, where the complexproduct of two source operands is divided by two, rounded according tothe rounding mode specified in the instruction, and loaded into thetarget register. The complex numbers are organized in the sourceregister such that halfword H1 contains the real component and halfwordH0 contains the imaginary component. The MPYCXD2 instruction format isshown in FIG. 3A. The syntax and operation description 310 is shown inFIG. 3B.

[0046] MPYCXJ instruction 400 (FIG. 4A), for multiplication of complexnumbers where the second argument is conjugated, where the complexproduct of the first source operand times the conjugate of the secondsource operand, is rounded according to the rounding mode specified inthe instruction and loaded into the target register. The complex numbersare organized in the source register such that halfword H1 contains thereal component and halfword H0 contains the imaginary component. TheMPYCXJ instruction format is shown in FIG. 4A. The syntax and operationdescription 410 is shown in FIG. 4B.

[0047] MPYCXJD2 instruction 500 (FIG. 5A), for multiplication of complexnumbers where the second argument is conjugated, with the resultsdivided by 2, where the complex product of the first source operandtimes the conjugate of the second operand, is divided by two, roundedaccording to the rounding mode specified in the instruction and loadedinto the target register. The complex numbers are organized in thesource register such that halfword H1 contains the real component andhalfword H0 contains the imaginary component. The MPYCXJD2 instructionformat is shown in FIG. 5A. The syntax and operation description 510 isshown in FIG. 5B.

[0048] All of the above instructions 200, 300, 400 and 500 complete in 2cycles and are pipelineable. That is, another operation can startexecuting unit after the first cycle. All complex multiplicationinstructions return a word containing the real and imaginary part of thecomplex product in half words H1 and H0 respectively.

[0049] To preserve maximum accuracy, and provide flexibility toprogrammers, four possible rounding modes are defined:

[0050] Round toward the nearest integer (referred to as ROUND)

[0051] Round toward 0 (truncate or fix, referred to as TRUNC)

[0052] Round toward infinity (round up or ceiling, the smallest integergreater than or equal to the argument, referred to as CEIL)

[0053] Round toward negative infinity (round down or floor, the largestinteger smaller than or equal to the argument, referred to as FLOOR).

[0054] Hardware suitable for implementing the multiply complexinstructions is shown in FIG. 6 and FIG. 7. These figures illustrate ahigh level view of the hardware apparatus 600 and 700 appropriate forimplementing the functions of these instructions. This hardwarecapability may be advantageously embedded in the ManArray multiplyaccumulate unit (MAU), one of the execution units 131 of FIG. 1 and ineach PE, along with other hardware capability supporting other MAUinstructions. As a pipelined operation, the first execute cycle beginswith a read of the source register operands from the compute registerfile (CRF) shown as registers 603 and 605 in FIG. 6 and as registers111, 127, 127′, 127″, and 127′″ in FIG. 1. These register values areinput to the MAU logic after some operand access delay in halfword datapaths as indicated to the appropriate multiplication units 607, 609,611, and 613 of FIG. 6. The outputs of the multiplication operationunits, X_(R)*Y_(R) 607, X_(R)*Y_(I) 609, X_(I)*Y_(R) 611, andX_(I)*Y_(I) 613, are stored in pipeline registers 615, 617, 619, and621, respectively. The second execute cycle, which can occur while a newmultiply complex instruction is using the first cycle executefacilities, begins with using the stored pipeline register values, inpipeline register 615, 617, 619, and 621, and appropriately adding inadder 625 and subtracting in subtractor 623 as shown in FIG. 6. The addfunction and subtract function are selectively controlled functionsallowing either addition or subtraction operations as specified by theinstruction. The values generated by the apparatus 600 shown in FIG. 6contain a maximum precision of calculation which exceeds 16-bits.Consequently, the appropriate bits must be selected and rounded asindicated in the instruction before storing the final results. Theselection of the bits and rounding occurs in selection and roundercircuit 627. The two 16-bit rounded results are then stored in theappropriate halfword position of the target register 629 which islocated in the compute register file (CRF). The divide by two variant ofthe multiply complex instruction 300 selects a different set of bits asspecified in the instruction through block 627. The hardware 627 shiftseach data value right by an additional 1-bit and loads two divided-by-2rounded and shifted values into each half word position in the targetregisters 629 in the CRF.

[0055] The hardware 700 for the multiply complex conjugate instruction400 is shown in FIG. 7. The main difference between multiply complex andmultiply complex conjugate is in adder 723 and subtractor 725 which swapthe addition and subtraction operation as compared with FIG. 6. Theresults from adder 723 and subtractor 725 still need to be selected androunded in selection and rounder circuit 727 and the final roundedresults stored in the target register 729 in the CRF. The divide by twovariant of the multiply complex conjugate instruction 500 selects adifferent set of bits as specified in the instruction through selectionand rounder circuit 727. The hardware of circuit 727 shifts each datavalue right by an additional 1-bit and loads two divided-by-2 roundedand shifted values into each half word position in the target registers729 in the CRF.

[0056] The FFT Algorithm

[0057] The power of indirect VLIW parallelism using the complexmultiplication instructions is demonstrated with the following fastFourier transform (FFT) example. The algorithm of this example is basedupon the sparse factorization of a discrete Fourier transform (DFT)matrix. Kronecker-product mathematics is used to demonstrate how ascalable algorithm is created.

[0058] The Kronecker product provides a means to express parallelismusing mathematical notation. It is known that there is a direct mappingbetween different tensor product forms and some important architecturalfeatures of processors. For example, tensor matrices can be created inparallel form and in vector form. J. Granata, M. Conner, R. Tolimieri,The Tensor Product: A Mathematical Programming Language for FFTs andother Fast DSP Operations, IEEE SP Magazine, January 1992, pp. 40-48.The Kronecker product of two matrices is a block matrix with blocks thatare copies of the second argument multiplied by the correspondingelement of the first argument. Details of an exemplary calculation ofmatrix vector products

y=(Im{circle over (×)}A)x

[0059] are shown in FIG. 10A. The matrix is block diagonal with m copiesof A. If vector x was distributed block-wise in m processors, theoperation can be done in parallel without any communication between theprocessors. On the other hand, the following calculation, shown indetail in FIG. 10B,

y=(A{circle over (×)}IM)x

[0060] requires that x be distributed physically on m processors forvector parallel computation.

[0061] The two Kronecker products are related via the identity

I{circle over (×)}A=P(A{circle over (×)}I)P ^(T)

[0062] where P is a special permutation matrix called stride permutationand P^(T) is the transpose permutation matrix. The stride permutationdefines the required data distribution for a parallel operation, or thecommunication pattern needed to transform block distribution to cyclicand vice-versa.

[0063] The mathematical description of parallelism and datadistributions makes it possible to conceptualize parallel programs, andto manipulate them using linear algebra identities and thus better mapthem onto target parallel architectures. In addition, Kronecker productnotation arises in many different areas of science and engineering. TheKronecker product simplifies the expression of many fast algorithms. Forexample, different FFT algorithms correspond to different sparse matrixfactorizations of the Discrete Fourier Transform (DFT), whose factorsinvolve Kronecker products. Charles F. Van Loan, ComputationalFrameworks for the Fast Fourier Transform, SIAM, 1992, pp 78-80.

[0064] The following equation shows a Kronecker product expression ofthe FFT algorithm, based on the Kronecker product factorization of theDFT matrix, F_(n) = (F_(p) ⊗ I_(m))D_(p, m)(I_(p) ⊗ F_(m))P_(n, p)

[0065] where:

[0066] n is the length of the transform

[0067] p is the number of PEs

[0068] m=n/p

[0069] The equation is operated on from right to left with the P_(n,p)permutation operation occurring first. The permutation directly maps toa direct memory access (DMA) operation that specifies how the data is tobe loaded in the PEs based upon the number of PEs p and length of thetransform n. F_(n) = (F_(p) ⊗ I_(m))D_(p, m)(I_(p) ⊗ F_(m))P_(n, p)

[0070] where P_(n,p) corresponds to DMA loading data with stride p tolocal PE memories.

[0071] In the next stage of operation all the PEs execute a local FFT oflength m=n/p with local data. No communications between PEs is required.F_(n) = (F_(p) ⊗ I_(m))D_(p,)(I_(p) ⊗ F_(m))P_(n,)

[0072] where (I_(p){circle over (×)}F_(m)) specifies that all PEsexecute a local FFT of length m sequentially, with local data.

[0073] In the next stage, all the PEs scale their local data by thetwiddle factors and collectively execute m distributed FFTs of length p.This stage requires inter-PE communications.F_(n) = (F_(p) ⊗ I_(m))D_(p, m)(I_(p) ⊗ F_(m))P_(n, p)

[0074] where (F_(p){circle over (×)}I_(m))D_(p,m) specifies that all PEsscale their local data by the twiddle factors and collectively executemultiple FFTs of length p on distributed data. In this final stage ofthe FFT computation, a relatively large number m of small distributedFFTs of size p must be calculated efficiently. The challenge is tocompletely overlap the necessary communications with the relativelysimple computational requirements of the FFT.

[0075] The sequence of illustrations of FIGS. 9A-9H outlines theManArray distributed FFT algorithm using the indirect VLIW architecture,the multiply complex instructions, and operating on the 2×2 ManArrayprocessor 100 of FIG. 1. The signal flow graph for the small FFT isshown in FIG. 8 and also shown in the right-hand-side of FIGS. 9A-9H. InFIG. 8, the operation for a 4 point FFT is shown where each PE executesthe operations shown on a horizontal row. The operations occur inparallel on each vertical time slice of operations as shown in thesignal flow graph figures in FIGS. 9A-9H. The VLIW code is displayed ina tabular form in FIGS. 9A-9H that corresponds to the structure of theManArray architecture and the iVLIW instruction. The columns of thetable correspond to the execution units available in the ManArray PE:Load Unit, Arithmetic Logic Unit (ALU), Multiply Accumulate Unit (MAU),Data Select Unit (DSU) and the Store Unit. The rows of the table can beinterpreted as time steps representing the execution of different iVLIWlines.

[0076] The technique shown is a software pipeline implemented approachwith iVLIWs. In FIGS. 9A-9I, the tables show the basic pipeline for PE3155. FIG. 9A represents the input of the data X and its correspondingtwiddle factor W by loading them from the PEs local memories, using theload indirect (Lii) instruction. FIG. 9B illustrates the complexarguments X and W which are multiplied using the MPYCX instruction 200,and FIG. 9C illustrates the communications operation between PEs, usinga processing element exchange (PEXCHG) instruction. Further details ofthis instruction are found in U.S. application Ser. No. 09/169,256entitled “Methods and Apparatus for ManArray PE-PE Switch Control” filedOct. 9, 1998. FIG. 9D illustrates the local and received quantities areadded or subtracted (depending upon the processing element, where forPE3 a subtract (sub) instrution is used). FIG. 9E illlustrates theresult being multiplied by −i on PE3, using the MPYCX instruction. FIG.9F illustrates another PE-to-PE communications operation where theprevious product is exchanged between the PEs, using the PEXCHGinstruction. FIG. 9G illustrates the local and received quantities areadded or subtracted (depending upon the processing element, where forPE3 a subtract (sub) instruction is used). FIG. 9H illustrates the stepwhere the results are stored to local memory, using a store indirect(sii) instruction.

[0077] The code for PEs 0, 1, and 2 is very similar, the twosubtractions in the arithmetic logic unit in steps 9D and 9G aresubstituted by additions or subtractions in the other PEs as required bythe algorithm displayed in the signal flow graphs. To achieve thatcapability and the distinct MPYCX operation in FIG. 9E shown in thesefigures, synchronous MIMD capability is required as described in greaterdetail in U.S. patent application Ser. No. 09/187,539 filed Nov. 6, 1998and incorporated by reference herein in its entirety. By appropriatepacking, a very tight software pipeline can be achieved as shown in FIG.9I for this FFT example using only two VLIWs.

[0078] In the steady state, as can be seen in FIG. 9I, the Load, ALU,MAU, and DSU units are fully utilized in the two VLIWs while the storeunit is used half of the time. This high utilization rate using twoVLIWs leads to very high performance. For example, a 256-point complexFFT can be accomplished in 425 cycles on a 2×2 ManArray.

[0079] As can be seen in the above example, this implementationaccomplishes the following:

[0080] An FFT butterfly of length 4 can be calculated and stored everytwo cycles, using four PEs.

[0081] The communication requirement of the FFT is completely overlappedby the computational requirements of this algorithm.

[0082] The communication is along the hypercube connections that areavailable as a subset of the connections available in the ManArrayinterconnection network.

[0083] The steady state of this algorithm consists of only two VLIWlines (the source code is two VLIW lines long).

[0084] All execution units except the Store unit are utilized all thetime, which lead us to conclude that this implementation is optimal forthis architecture.

[0085] Problem Size Discussion

[0086] The equation:F_(n) = (F_(p) ⊗ I_(m))D_(p, m)(I_(p) ⊗ F_(m))P_(n, p)

[0087] where:

[0088] n is the length of the transform,

[0089] p is the number of PEs, and

[0090] m=n/p

[0091] is parameterized by the length of the transform n and the numberof PEs, where m=n/p relates to the size of local memory needed by thePEs. For a given power-of-2 number of processing elements and asufficient amount of available local PE memory, distributed FFTs of sizep can be calculated on a ManArray processor since only hypercubeconnections are required. The hypercube of p or fewer nodes is a propersubset of the ManArray network. When p is a multiple of the number ofprocessing elements, each PE emulates the operation of more than onevirtual node. Therefore, any size of FFT problem can be handled usingthe above equation on any size of ManArray processor.

[0092] For direct execution, in other words, no emulation of virtualPEs, on a ManArray of size p, we need to provide a distributed FFTalgorithm of equal size. For p=1, it is the sequential FFT. For p=2, theFFT of length 2 is the butterfly:

Y 0=x 0+w*X 1, and

Y 1=x 0−w*X 1

[0093] where X0 and Y0 reside in or must be saved in the local memory ofPE0 and X1 and Y1 on PE1, respectively. The VLIWs in PE0 and PE1 in a1×2 ManArray processor (p=2) that are required for the calculation ofmultiple FFTs of length 2 are shown in FIG. 9J which shows that two FFTresults are produced every two cycles using four VLIWs.

[0094] Extending Complex Multiplication

[0095] It is noted that in the two-cycle complex multiplication hardwaredescribed in FIGS. 6 and 7, the addition and subtraction blocks 623,625, 723, and 725 operate in the second execution cycle. By includingthe MPYCX, MPYCXD2, MPYCXJ, and MPYCXJD2 instructions in the ManArrayMAU, one of the execution units 131 of FIG. 1, the complexmultiplication operations can be extended. The ManArray MAU alsosupports multiply accumulate operations (MACs) as shown in FIGS. 11A and12A for use in general digital signal processing (DSP) applications. Amultiply accumulate instruction (MPYA) 1100 as shown in FIG. 11A, and asum two product accumulate instruction (SUM2PA) 1200 as shown in FIG.12A, are defined as follows.

[0096] In the MPYA instruction 1100 of FIG. 11A, the product of sourceregisters Rx and Ry is added to target register Rt. The word multiplyform of this instruction multiplies two 32-bit values producing a 64-bitresult which is added to a 64-bit odd/even target register. The dualhalfword form of MPYA instruction 1100 multiplies two pairs of 16-bitvalues producing two 32-bit results: one is added to the odd 32-bitword, the other is added to the even 32-bit word of the odd/even targetregister pair. Syntax and operation details 1110 are shown in FIG. 11B.In the SUM2PA instruction 1200 of FIG. 12A, the product of the highhalfwords of source registers Rx and Ry is added to the product of thelow halfwords of Rx and Ry and the result is added to target register Rtand then stored in Rt. Syntax and operation details 1210 are shown inFIG. 12B.

[0097] Both MPYA and SUMP2A generate the accumulate result in the secondcycle of the two-cycle pipeline operation. By merging MPYCX, MPYCXD2,MPYCXJ, and MPYCXJD2 instructions with MPYA and SUMP2A instructions, thehardware supports the extension of the complex multiply operations withan accumulate operation. The mathematical operation is defined as:Z_(T)=Z_(R)+X_(R) Y_(R)−X_(I) Y_(I)+i(Z_(I)+X_(R) Y_(I)+X_(I) Y_(R)),where X=X_(R)+iX_(I), Y=Y_(R)+iY_(I) and i is an imaginary number, orthe square root of negative one, with i²=−1. This complex multiplyaccumulate is calculated in a variety of contexts, and it has beenrecognized that it will be highly advantageous to perform thiscalculation faster and more efficiently.

[0098] For this purpose, an MPYCXA instruction 1300 (FIG. 13A), anMPYCXAD2 instruction 1400 (FIG. 14A), an MPYCXJA instruction 1500 (FIG.15A), and an MPYCXJAD2 instruction 1600 (FIG. 16A) define the specialhardware instructions that handle the multiplication with accumulate forcomplex numbers. The MPYCXA instruction 1300, for multiplication ofcomplex numbers with accumulate is shown in FIG. 13. Utilizing thisinstruction, the accumulated complex product of two source operands isrounded according to the rounding mode specified in the instruction andloaded into the target register. The complex numbers are organized inthe source register such that halfword H1 contains the real componentand halfword H0 contains the imaginary component. The MPYCXA instructionformat is shown in FIG. 13A. The syntax and operation description 1310is shown in FIG. 13B.

[0099] The MPYCXAD2 instruction 1400, for multiplication of complexnumbers with accumulate, with the results divided by two is shown inFIG. 14A. Utilizing this instruction, the accumulated complex product oftwo source operands is divided by two, rounded according to the roundingmode specified in the instruction, and loaded into the target register.The complex numbers are organized in the source register such thathalfword H1 contains the real component and halfword H0 contains theimaginary component. The MPYCXAD2 instruction format is shown in FIG.14A. The syntax and operation description 1410 is shown in FIG. 14B.

[0100] The MPYCXJA instruction 1500, for multiplication of complexnumbers with accumulate where the second argument is conjugated is shownin FIG. 15A. Utilizing this instruction, the accumulated complex productof the first source operand times the conjugate of the second sourceoperand, is rounded according to the rounding mode specified in theinstruction and loaded into the target register. The complex numbers areorganized in the source register such that halfword H1 contains the realcomponent and halfword H0 contains the imaginary component. The MPYCXJAinstruction format is shown in FIG. 15A. The syntax and operationdescription 1510 is shown in FIG. 15B.

[0101] The MPYCXJAD2 instruction 1600, for multiplication of complexnumbers with accumulate where the second argument is conjugated, withthe results divided by two is shown in FIG. 16A. Utilizing thisinstruction, the accumulated complex product of the first source operandtimes the conjugate of the second operand, is divided by two, roundedaccording to the rounding mode specified in the instruction and loadedinto the target register. The complex numbers are organized in thesource register such that halfword H1 contains the real component andhalfword H0 contains the imaginary component. The MPYCXJAD2 instructionformat is shown in FIG. 16A. The syntax and operation description 1610is shown in FIG. 16B.

[0102] All instructions of the above instructions 1100, 1200, 1300,1400, 1500 and 1600 complete in two cycles and are pipeline-able. Thatis, another operation can start executing on the execution unit afterthe first cycle. All complex multiplication instructions 1300, 1400,1500 and 1600 return a word containing the real and imaginary part ofthe complex product in half words H1 and H0 respectively.

[0103] To preserve maximum accuracy, and provide flexibility toprogrammers, the same four rounding modes specified previously forMPYCX, MPYCXD2, MPYCXJ, and MPYCXJD2 are used in the extended complexmultiplication with accumulate.

[0104] Hardware 1700 and 1800 for implementing the multiply complex withaccumulate instructions is shown in FIG. 17 and FIG. 18, respectively.These figures illustrate the high level view of the hardware 1700 and1800 appropriate for these instructions. The important changes to notebetween FIG. 17 and FIG. 6 and between FIG. 18 and FIG. 7 are in thesecond stage of the pipeline where the two-input adder blocks 623, 625,723, and 725 are replaced with three-input adder blocks 1723, 1725,1823, and 1825. Further, two new half word source operands are used asinputs to the operation. The Rt.H1 1731 (1831) and Rt.H0 1733 (1833)values are properly aligned and selected by multiplexers 1735 (1835) and1737 (1837) as inputs to the new adders 1723 (1823) and 1725 (1825). Forthe appropriate alignment, Rt.H1 is shifted right by 1-bit and Rt.H0 isshifted left by 15-bits. The add/subtract, add/sub blocks 1723 (1823)and 1725 (1825), operate on the input data and generate the outputs asshown. The add function and subtract function are selectively controlledfunctions allowing either addition or subtraction operations asspecified by the instruction. The results are rounded and bits 30-15 ofboth 32-bit results are selected 1727 (1827) and stored in theappropriate half word of the target register 1729 (1829) in the CRF. Itis noted that the multiplexers 1735 (1835) and 1737 (1837) select thezero input, indicated by the ground symbol, for the non-accumulateversions of the complex multiplication series of instructions.

[0105] While the present invention has been disclosed in the context ofvarious aspects of presently preferred embodiments, it will berecognized that the invention may be suitably applied to otherenvironments consistent with the claims which follow.

We claim:
 1. An apparatus for the efficient processing of complexmultiplication computations, the apparatus comprising: at least onecontroller sequence processor (SP); a memory for storing process controlinstructions; a first multiply complex numbers instruction stored in thememory and operative to control the PEs to carry out a multiplicationoperation involving a pair of complex numbers; and hardware forimplementing the first multiply complex numbers instruction.
 2. Theapparatus of claim 1 further comprising a plurality of processingelements (PEs) interconnected with said SP and arranged in an N×N arrayinterconnected in a manifold array interconnection network.
 3. Theapparatus of claim 1 wherein the first multiply complex instructioncompletes execution in 2 cycles.
 4. The apparatus of claim 1 wherein thefirst multiply complex instruction is tightly pipelineable.
 5. Theapparatus of claim 1 wherein each complex number is stored as a word,each word comprising a first half word and a second half word, with areal component of each complex number being stored as the first halfword and an imaginary component of each complex number being stored asthe second half word.
 6. The apparatus of claim 1 wherein the firstmultiply complex instruction includes a plurality of rounding modes, therounding modes including: rounding toward a nearest integer; roundingtoward zero; rounding toward infinity; and rounding toward negativeinfinity.
 7. The apparatus of claim 1 wherein the first multiply complexnumbers instruction is one of the following group of instructions: amultiply complex numbers (MPYCX), a multiply complex numbers instruction(MPYCXJ) operative to carry out the multiplication of a pair of complexnumbers where an argument is conjugated, a multiply complex numbersinstruction (MPYCXD2) operative to carry out the multiplication of apair of complex numbers with a result divided by two, and a multiplycomplex numbers instruction (MPYCXJD2) operative to carry out themultiplication of a pair of complex numbers where an argument isconjugated with a result divided by two.
 8. The apparatus of claim 1further comprising a multiply accumulate unit including the memory forstoring the first multiply complex numbers instruction.
 9. The apparatusof claim 8 wherein the multiply accumulate unit operates in response toa multiply accumulate instruction (MPYA) to extend a multiplicationoperation with an accumulate operation.
 10. The apparatus of claim 8wherein the multiply accumulate unit operates in response to a sum twoproduct accumulate instruction (SUM2PA) to extend two multiplicationoperations with an accumulate operation.
 11. The apparatus of claim 9wherein the multiply accumulate unit operates in response to a multiplycomplex with accumulate instruction (MPYCXA) to carry out themultiplication of a pair of complex numbers with accumulation of a thirdcomplex number.
 12. The apparatus of claim 11 wherein the MPYCXAinstruction completes execution in 2 cycles.
 13. The apparatus of claim12 wherein the MPYCXA instruction is tightly pipelineable.
 14. Theapparatus of claim 1 further comprising one or more of the followingadditional instructions (MPYCXA, MPYCXAD2, MPYCXJA or MPYCXJAD2) storedin the memory to carry out complex multiplication operations pipelinedin 2 cycles.
 15. A method for the computation of an FFT by a pluralityof processing elements (PEs), the method comprising the steps of:loading input data from a memory into each PE in a cyclic manner;calculating a local FFT by each PE; multiplying by the twiddle factorsand calculating a FFT by the cluster of PEs; and loading the FFTs intothe memory.
 16. A method for the computation of a distributed FFT by anN×N processing element (PE) array, the method comprising the steps of:loading a complex number x and a corresponding twiddle factor w from amemory into each of the PEs; calculating a first product by themultiplication of the complex numbers x and w; transmitting the firstproduct from each of the PEs to another PE in the N×N array; receivingthe first product and treating it as a second product in each of thePEs; selectively adding or subtracting the first product and the secondproduct to form a first result; calculating a third product in selectedPEs; transmitting the first result or third product in selected PEs toanother PE in the N×N array; selectively adding or subtracting thereceived values to form a second result; and storing the second resultsin the memory.
 17. A method for efficient computation by a 2×2processing element (PE) array interconnected in a manifold arrayinterconnection network, the array comprising four PEs (PE0, PE1, PE2and PE3), the method comprising the steps of: loading a complex number xand a corresponding twiddle factor w from a memory into each of the fourPEs, complex number x including subparts x0, x1, x2 and x3, twiddlefactor w including subparts w0, w1, w2 and w3; multiplying the complexnumbers x and w, such that PE0 multiplies x0 and w0 to produce aproduct0, PE1 multiplies x1 and w1 to produce a product1, PE2 multipliesx2 and w2 to produce a product2, and PE3 multiplies x3 and w3 to producea product3; transmitting the product0, the product1, the product2 andthe product3, such that PE0 transmits the product0 to PE2, PE1 transmitsthe product1 to PE3, PE2 transmits the product2 to PE0, and PE3transmits the product3 to PE1; and performing arithmetic logicoperations, such that PE0 adds the product0 and the product2 to producea sum t0, PE1 adds the product1 and the product3 to produce a sum t2,PE2 subtracts the product2 from the product0 to produce a sum t1, andPE3 subtracts the product3 from the product1 to produce a result whichis multiplied by −i to produce a sum t3.
 18. The method of claim 17further comprising the steps of: transmitting the sums t0, t1, t2 andt3, such that PE0 transmits t0 to PE1, PE1 transmits t2 to PE0, PE2transmits t1 to PE3, and PE3 transmits t3 to PE2; performing thearithmetic logic operations, such that PE0 adds t0 and t2 to produce ay0, PE1 subtracts t2 from t0 to produce a y1, PE2 adds t1 and t3 toproduce a y2, and PE3 subtracts t3 from t1 to produce a y3; and storingy0, y1, y2 and y3 in a memory.
 19. A special hardware instruction forhandling the multiplication with accumulate for two complex numbers froma source register whereby utilizing said instruction and accumulatedcomplex product of two source operands is rounded according to arounding mode specified in the instruction and loaded into a targetregister with the complex numbers organized in the source such that ahalfword (H1) contains the real component and a halfword (H0) containsthe imaginary component.
 20. The special hardware instruction of claim19 wherein the accumulated complex product is divided by two before itis rounded.
 21. An apparatus to efficiently fetch instructions includingcomplex multiplication instructions and an accumulate form ofmultiplication instructions from a memory element and dispatch thefetched instruction to at least one of a plurality of multiply complexand multiply with accumulate execution units to carry out theinstruction specified operation, the apparatus comprising: a memoryelement; means for fetching said instructions from the memory element; aplurality of multiply complex and multiply with accumulate executionunits; and means to dispatch the fetched instruction to at least one ofsaid plurality of execution units to carry out the instruction specifiedoperation.
 22. The apparatus of claim 21 further comprising: aninstruction register to hold a dispatched multiply complex instruction(MPYCX); means to decode the MPYCX instruction and control the executionof the MPYCX instruction; two source registers each holding a complexnumber as operand inputs to the multiply complex execution hardware;four multiplication units to generate terms of the complexmultiplication; four pipeline registers to hold the multiplicationresults; an add function which adds two of the multiplication resultsfrom the pipeline registers for the imaginary component of the result; asubtract function which subtracts two of the multiplication results fromthe pipeline registers for the real component of the result; a round andselect unit to format the real and imaginary results; and a resultstorage location for saving the final multiply complex result, wherebythe apparatus is operative for the efficient processing of multiplycomplex computations.
 23. The apparatus of claim 21 wherein the meansfor fetching said instructions is a sequence processor (SP) controller.24. The apparatus of claim 22 wherein the round and select unit providesa shift right as a divide by 2 operation for a multiply complex divideby 2 instruction (MPYCXD2).
 25. The apparatus of claim 21 furthercomprising: an instruction register to hold a dispatched multiplycomplex instruction (MPYCXJ); means to decode the MPYCXJ instruction andcontrol the execution of the MPYCXJ instruction; two source registerseach holding a complex number as operand inputs to the multiply complexexecution hardware; four multiplication units to generate terms of thecomplex multiplication; four pipeline registers to hold themultiplication results; an add function which adds two of themultiplication results from the pipeline registers for the realcomponent of the result; a subtract function which subtracts two of themultiplication results from the pipeline registers for the imaginarycomponent of the result; a round and select unit to format the real andimaginary results; and a result storage location for saving the finalmultiply complex conjugate result, whereby the apparatus is operativefor the efficient processing of multiply complex conjugate computations.26. The apparatus of claim 25 wherein the round and select unit providesa shift right as a divide by 2 operation for a multiply complexconjugate divide by 2 instruction (MPYCXJD2).
 27. The apparatus of claim21 further comprising: an instruction register to hold the dispatchedmultiply accumulate instruction (MPYA); means to decode the MPYAinstruction and control the execution of the MPYA instruction; twosource registers each holding a source operand as inputs to the multiplyaccumulate execution hardware; at least two multiplication units togenerate two products of the multiplication; at least two pipelineregisters to hold the multiplication results; at least two accumulateoperand inputs to the second pipeline stage accumulate hardware; atleast two add functions which each adds the results from the pipelineregisters with the third accumulate operand creating two multiplyaccumulate results; a round and select unit to format the results ifrequired by the MPYA instruction; and a result storage location forsaving the final multiply accumulate result, whereby the apparatus isoperative for the efficient processing of multiply accumulatecomputations.
 28. The apparatus of claim 21 further comprising: aninstruction register to hold a dispatched multiply accumulateinstruction (SUM2PA); means to decode the SUM2PA instruction and controlthe execution of the SUM2PA instruction; at least two source registerseach holding a source operand as inputs to the SUM2PA executionhardware; at least two multiplication units to generate two products ofthe multiplication; at least two pipeline registers to hold themultiplication results; at least one accumulate operand input to thesecond pipeline stage accumulate hardware; at least one add functionwhich adds the results from the pipeline registers with the thirdaccumulate operand creating a SUM2PA result; a round and select unit toformat the results if required by the SUM2PA instruction; and a resultstorage location for saving the final result, whereby the apparatus isoperative for the efficient processing of sum of 2 products accumulatecomputations
 29. The apparatus of claim 21 further comprising: aninstruction register to hold the dispatched multiply complex accumulateinstruction (MPYCXA); means to decode the MPYCXA instruction and controlthe execution of the MPYCXA instruction; two source registers eachholding a complex number as operand inputs to the multiply complexaccumulate execution hardware; four multiplication units to generateterms of the complex multiplication; four pipeline registers to hold themultiplication results; at least two accumulate operand inputs to thesecond pipeline stage accumulate hardware; an add function which addstwo of the multiplication results from the pipeline registers and alsoadds one of the accumulate operand input for the imaginary component ofthe result; a subtract function which subtracts two of themultiplication results from the pipeline registers and also adds theother accumulate operand input for the real component of the result; around and select unit to format the real and imaginary results; and aresult storage location for saving the final multiply complex accumulateresult, whereby the apparatus is operative for the efficient processingof multiply complex accumulate computations.
 30. The apparatus of claim29 wherein the round and select unit provides a shift right as a divideby 2 operation for a multiply complex accumulate divide by 2 instruction(MPYCXAD2).
 31. The apparatus of claim 21 further comprising: aninstruction register to hold the dispatched multiply complex conjugateaccumulate instruction (MPYCXJA); means to decode the MPYCXJAinstruction and control the execution of the MPYCXJA instruction; twosource registers each holding a complex number as operand inputs to themultiply complex accumulate execution hardware; four multiplicationunits to generate terms of the complex multiplication; four pipelineregisters to hold the multiplication results; at least two accumulateoperand inputs to the second pipeline stage accumulate hardware; an addfunction which adds two of the multiplication results from the pipelineregisters and also adds one of the accumulate operand input for the realcomponent of the result; a subtract function which subtracts two of themultiplication results from the pipeline registers and also adds theother accumulate operand input for the imaginary component of theresult; a round and select unit to format the real and imaginaryresults; and a result storage location for saving the final multiplycomplex conjugate accumulate result, whereby the apparatus is operativefor the efficient processing of multiply complex conjugate accumulatecomputations.
 32. The apparatus of claim 31 wherein the round and selectunit provides a shift right as a divide by 2 operation for a multiplycomplex conjugate accumulate divide by 2 instruction (MPYCXJAD2). 33.The apparatus of claim 21 wherein the complex multiplicationinstructions and accumulate form of multiplication instructions includeMPYCX, MPYCXD2, MPYCXJ, MPYCXJD2, MPYCXA, MPYCXAD2, MPYCXJA, MPYCXJAD2instructions, and all of these instructions complete execution in 2cylces.
 34. The apparatus of claim 21 wherein the complex multiplicationinstructions and accumulate form of multiplication instructions includeMPYCX, MPYCXD2, MPYCXJ, MPYCXJD2, MPYCXA, MPYCXAD2, MPYCXJA, MPYCXJAD2instructions, and all of these instructions are tightly pipelineable.35. An apparatus for the efficient processing of an FFT, the apparatuscomprising: at least one controller sequence processor (SP); a pluralityof processing elements (PEs) arranged in an N×N array interconnected ina manifold (ManArray) interconnection network; and a memory for storinginstructions to be processed by the SP and by the array of PEs.
 36. Theapparatus of claim 22 wherein the add function and subtract function areselectively controlled functions allowing either addition or subtractionoperations as specified by the instruction.
 37. The apparatus of claim25 wherein the add function and subtract function are selectivelycontrolled functions allowing either addition or subtraction operationsas specified by the instruction.
 38. The apparatus of claim 29 whereinthe add function and subtract function are selectively controlledfunctions allowing either addition or subtraction operations asspecified by the instruction.
 39. The apparatus of claim 31 wherein theadd function and subtract function are selectively controlled functionsallowing either addition or subtraction operations as specified by theinstruction.